Gabriela Palomo gabriella.palomo@gmail.com
Hannah Griebling
hgriebli@mail.ubc.ca
After today’s lecture, you’ll be able to:
I highly recommend you use RStudio Projects instead of setwd()
Go to RStudio and click on File > New Project.
For our own project, let’s go ahead and choose New Directory and let’s name our project: 2024-data-manipulation-UWIN.
You will have a series of directories inside your project, depending on the type of work that you’ll be working on. Some people recommend following the same structure that you would use if creating an r package. However, I think that at a minimum, you could have the following structure:
File names should be machine readable: avoid spaces, symbols, and special characters. Don’t rely on case sensitivity to distinguish files.
File names should be human readable: use file names to describe what’s in the file.
File names should play well with default ordering: start file names with numbers so that alphabetical sorting puts them in the order they get used.
Why are these good names? Well because if you have several of those, you can arrange them by date (descending or ascending), or by order of fig-01, fig-02.
Warning
It’s important to note that fig-01.png is not the same as fig-1.png because your computer will read the following files in this order: fig1.png, fig10.png, fig11.png, fig2.png.
At the beginning there was only one pipe operator, %>%, which is from the magrittr package.
The idea is to have a way to pipe an object forward into a function or call expression.
It should be read as ‘then’. For example: The following code is read as follows: start with object df THEN select col1.
Now, base R has it’s own pipe called native pipe, |>, which is also read as ‘then’.
You can activate this native pipe by going to Tools > Global options > Code and selecting that option.
dplyr verbs: data transformationdplyr is a package based on a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
mutate() adds new variables that are functions of existing variablesselect() picks variables based on their namesfilter() picks cases based on their valuessummarise() reduces multiple values down to a single summaryarrange() changes the ordering of the rowsgroup_by() groups variables for you to perform operations on the grouped data. Always remember to ungroup() once you are finishedThese can be linked together by pipes |> or %>%
Cool cheatsheet for dplyr
tidyr for tidying dataThe tidyr package has a series of functions that are named after verbs that will help you tidy and clean data.
The goal of tidyr is to help you create tidy data. Tidy data is data where:
Each variable is a column; each column is a variable
Each observation is a row; each row is an observation
Each value is a cell; each cell is a single value
Cool cheatsheet for tidyr
UWIN R Workshops - March 2024